Goto

Collaborating Authors

 systematic review


Machine learning for violence prediction: a systematic review and critical appraisal

Kozhevnikova, Stefaniya, Yukhnenko, Denis, Scola, Giulio, Fazel, Seena

arXiv.org Artificial Intelligence

Purpose To conduct a systematic review of machine learning models for predicting violent behaviour by synthesising and appraising their validity, usefulness, and performance. Methods We systematically searched nine bibliographic databases and Google Scholar up to September 2025 for development and/or validation studies on machine learning methods for predicting all forms of violent behaviour. We synthesised the results by summarising discrimination and calibration performance statistics and evaluated study quality by examining risk of bias and clinical utility. Results We identified 38 studies reporting the development and validation of 40 models. Most studies reported Area Under the Curve (AUC) as the discrimination statistic with a range of 0.68-0.99. Only eight studies reported calibration performance, and three studies reported external validation. 31 studies had a high risk of bias, mainly in the analysis domain, and three studies had low risk of bias. The overall clinical utility of violence prediction models is poor, as indicated by risks of overfitting due to small samples, lack of transparent reporting, and low generalisability. Conclusion Although black box machine learning models currently have limited applicability in clinical settings, they may show promise for identifying high-risk individuals. We recommend five key considerations for violence prediction modelling: (i) ensuring methodological quality (e.g. following guidelines) and interdisciplinary collaborations; (ii) using black box algorithms only for highly complex data; (iii) incorporating dynamic predictions to allow for risk monitoring; (iv) developing more trustworthy algorithms using explainable methods; and (v) applying causal machine learning approaches where appropriate.


Strategic Innovation Management in the Age of Large Language Models Market Intelligence, Adaptive R&D, and Ethical Governance

Aghaei, Raha, Kiaei, Ali A., Boush, Mahnaz, Rofoosheh, Mahan, Zavvar, Mohammad

arXiv.org Artificial Intelligence

By automating knowledge discovery, boosting hypothesis creation, integrating transdisciplinary insights, and enabling coope ration within innovation ecosystems, LLMs dramatically improve the efficiency and effectiveness of research processes. Through extensive analysis of scientific literature, patent databases, and experimental data, these models enable more flexible and infor med R&D workflows, ultimately accelerating innovation cycles and lowering time - to - market for breakthrough ideas.


A systematic review of relation extraction task since the emergence of Transformers

Celian, Ringwald, Gandon, null, Fabien, null, Catherine, Faron, Franck, Michel, Hanna, Abi Akl

arXiv.org Artificial Intelligence

This article presents a systematic review of relation extraction (RE) research since the advent of Transformer-based models. Using an automated framework to collect and annotate publications, we analyze 34 surveys, 64 datasets, and 104 models published between 2019 and 2024. The review highlights methodological advances, benchmark resources, and the integration of semantic web technologies. By consolidating results across multiple dimensions, the study identifies current trends, limitations, and open challenges, offering researchers and practitioners a comprehensive reference for understanding the evolution and future directions of RE.


Large language models for automated PRISMA 2020 adherence checking

Kataoka, Yuki, So, Ryuhei, Banno, Masahiro, Tsujimoto, Yasushi, Takayama, Tomohiro, Yamagishi, Yosuke, Tsuge, Takahiro, Yamamoto, Norio, Suda, Chiaki, Furukawa, Toshi A.

arXiv.org Artificial Intelligence

Evaluating adherence to PRISMA 2020 guideline remains a burden in the peer review process. To address the lack of shareable benchmarks, we constructed a copyright-aware benchmark of 108 Creative Commons-licensed systematic reviews and evaluated ten large language models (LLMs) across five input formats. In a development cohort, supplying structured PRISMA 2020 checklists (Markdown, JSON, XML, or plain text) yielded 78.7-79.7% accuracy versus 45.21% for manuscript-only input (p less than 0.0001), with no differences between structured formats (p>0.9). Across models, accuracy ranged from 70.6-82.8% with distinct sensitivity-specificity trade-offs, replicated in an independent validation cohort. We then selected Qwen3-Max (a high-sensitivity open-weight model) and extended evaluation to the full dataset (n=120), achieving 95.1% sensitivity and 49.3% specificity. Structured checklist provision substantially improves LLM-based PRISMA assessment, though human expert verification remains essential before editorial decisions.


Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study

He, Zhichao, Bian, Mouxiao, Zhu, Jianhong, Chen, Jiayuan, Wang, Yunqiu, Zhao, Wenxia, Li, Tianbin, Han, Bing, Xu, Jie, Wu, Junyan

arXiv.org Artificial Intelligence

The Consolidated Standards of Reporting Trials statement is the global benchmark for transparent and high-quality reporting of randomized controlled trials. Manual verification of CONSORT adherence is a laborious, time-intensive process that constitutes a significant bottleneck in peer review and evidence synthesis. This study aimed to systematically evaluate the accuracy and reliability of contemporary LLMs in identifying the adherence of published RCTs to the CONSORT 2010 statement under a zero-shot setting. We constructed a golden standard dataset of 150 published RCTs spanning diverse medical specialties. The primary outcome was the macro-averaged F1-score for the three-class classification task, supplemented by item-wise performance metrics and qualitative error analysis. Overall model performance was modest. The top-performing models, Gemini-2.5-Flash and DeepSeek-R1, achieved nearly identical macro F1 scores of 0.634 and Cohen's Kappa coefficients of 0.280 and 0.282, respectively, indicating only fair agreement with expert consensus. A striking performance disparity was observed across classes: while most models could identify compliant items with high accuracy (F1 score > 0.850), they struggled profoundly with identifying non-compliant and not applicable items, where F1 scores rarely exceeded 0.400. Notably, some high-profile models like GPT-4o underperformed, achieving a macro F1-score of only 0.521. LLMs show potential as preliminary screening assistants for CONSORT checks, capably identifying well-reported items. However, their current inability to reliably detect reporting omissions or methodological flaws makes them unsuitable for replacing human expertise in the critical appraisal of trial quality.


LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

Madeyski, Lech, Kitchenham, Barbara, Shepperd, Martin

arXiv.org Artificial Intelligence

Context: Large language models (LLMs) are released faster than users' ability to evaluate them rigorously. When LLMs underpin research, such as identifying relevant literature for systematic reviews (SRs), robust empirical assessment is essential. Objective: We identify and discuss key challenges in assessing LLM performance for selecting relevant literature, identify good (evaluation) practices, and propose recommendations. Method: Using a recent large-scale study as an example, we identify problems with the use of traditional metrics for assessing the performance of Gen-AI tools for identifying relevant literature in SRs. We analyzed 27 additional papers investigating this issue, extracted the performance metrics, and found both good practices and widespread problems, especially with the use and reporting of performance metrics for SR screening. Results: Major weaknesses included: i) a failure to use metrics that are robust to imbalanced data and do not directly indicate whether results are better than chance, e.g., the use of Accuracy, ii) a failure to consider the impact of lost evidence when making claims concerning workload savings, and iii) pervasive failure to report the full confusion matrix (or performance metrics from which it can be reconstructed) which is essential for future meta-analyses. On the positive side, we extract good (evaluation) practices on which our recommendations for researchers and practitioners, as well as policymakers, are built. Conclusions: SR screening evaluations should prioritize lost evidence/recall alongside chance-anchored and cost-sensitive Weighted MCC (WMCC) metric, report complete confusion matrices, treat unclassifiable outputs as referred-back positives for assessment, adopt leakage-aware designs with non-LLM baselines and open artifacts, and ground conclusions in cost-benefit analysis where FNs carry higher penalties than FPs.


Appendix overview

Neural Information Processing Systems

We further focus on measuring the overlap between datasets. We check for the overlap on the level of systematic reviews based on the review's Cochrane ID.



Artificial Intelligence in Elementary STEM Education: A Systematic Review of Current Applications and Future Challenges

Memari, Majid, Ruggles, Krista

arXiv.org Artificial Intelligence

Artificial intelligence (AI) is transforming elementary STEM education, yet evidence remains fragmented. This systematic review synthesizes 258 studies (2020-2025) examining AI applications across eight categories: intelligent tutoring systems (45% of studies), learning analytics (18%), automated assessment (12%), computer vision (8%), educational robotics (7%), multimodal sensing (6%), AI-enhanced extended reality (XR) (4%), and adaptive content generation. The analysis shows that most studies focus on upper elementary grades (65%) and mathematics (38%), with limited cross-disciplinary STEM integration (15%). While conversational AI demonstrates moderate effectiveness (d = 0.45-0.70 where reported), only 34% of studies include standardized effect sizes. Eight major gaps limit real-world impact: fragmented ecosystems, developmental inappropriateness, infrastructure barriers, lack of privacy frameworks, weak STEM integration, equity disparities, teacher marginalization, and narrow assessment scopes. Geographic distribution is also uneven, with 90% of studies originating from North America, East Asia, and Europe. Future directions call for interoperable architectures that support authentic STEM integration, grade-appropriate design, privacy-preserving analytics, and teacher-centered implementations that enhance rather than replace human expertise.


ROBoto2: An Interactive System and Dataset for LLM-assisted Clinical Trial Risk of Bias Assessment

Hevia, Anthony, Chintalapati, Sanjana, Lai, Veronica Ka Wai, Nguyen, Thanh Tam, Wong, Wai-Tat, Klassen, Terry, Wang, Lucy Lu

arXiv.org Artificial Intelligence

We present ROBOTO2, an open-source, web-based platform for large language model (LLM)-assisted risk of bias (ROB) assessment of clinical trials. ROBOTO2 streamlines the traditionally labor-intensive ROB v2 (ROB2) annotation process via an interactive interface that combines PDF parsing, retrieval-augmented LLM prompting, and human-in-the-loop review. Users can upload clinical trial reports, receive preliminary answers and supporting evidence for ROB2 signaling questions, and provide real-time feedback or corrections to system suggestions. ROBOTO2 is publicly available at https://roboto2.vercel.app/, with code and data released to foster reproducibility and adoption. We construct and release a dataset of 521 pediatric clinical trial reports (8954 signaling questions with 1202 evidence passages), annotated using both manually and LLM-assisted methods, serving as a benchmark and enabling future research. Using this dataset, we benchmark ROB2 performance for 4 LLMs and provide an analysis into current model capabilities and ongoing challenges in automating this critical aspect of systematic review.